-
Notifications
You must be signed in to change notification settings - Fork 37
Add RISCV64 RVV backend #1037
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RISCV64 RVV backend #1037
Conversation
0637941 to
6b4f845
Compare
|
@mjosaarinen If you have |
dcab861 to
38e79bb
Compare
|
Note that on the "fastntt3" branch, there are layer-merged implementations of the NTT and INTT that are highly amenable to auto-vectorization with compilers like GCC 14. Benchmarks of that code on an RV64v target were encouraging, so might provide some inspiration for a fully vectorized, hand-written back-end. |
Yeah you can easily double the speed with autovectorization alone, and some Google folks were of the opinion that they wanted to rely on that entirely in BoringSSL (RISC-V Android etc), rather than maintain a hand-optimized version. The resulting code is pretty wild; I looked at that when considering RISC-V ISA extensions ( see slides 17 for example in https://mjos.fi/doc/20240325-rwc-riscv.pdf ). It was almost "too good" -- I suspect that Google has used those NTTs as a microbenchmark when developing LLVM autovectorizers :) |
Yeah, sorry for abusing your CI like that (I wasn't expecting it to be that extensive), I could have just read the documentation. I'll set up this nix thing. |
|
@mjosaarinen Sorry, we should have pointed that out earlier. With the |
|
@mkannwischer @mjosaarinen The RV functional tests in the ci-cross shell don't seem to pass. % nix develop --extra-experimental-features 'nix-command flakes' .#ci-cross
% CROSS_PREFIX=riscv64-unknown-linux-gnu- make func OPT=1 AUTO=1 -j32
% EXEC_WRAPPER=qemu-riscv64 make run_func_512 -j32 OPT=1 AUTO=1
qemu-riscv64 test/build/mlkem512/bin/test_mlkem512
ERROR (test/test_mlkem.c,41)
ERROR (test/test_mlkem.c,225)
make: *** [Makefile:53: run_func_512] Error 1Same for Could you investigate? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RV64 cross tests in CI are failing. I don't know if this is an issue with the code or the CI setup, but it needs looking into.
It looks like some things are running fine, and some things have build problems. I suspect that platform auto-detection code (in the C header files + makefies) is to blame (failing no_opt means "no optimization"?) It's a quite complicated CI -- can you tell me what would be the best way to debug this stage of the CI build process locally? |
Sure. If you have nix installed, then it's exactly what I posted above: You open the nix shell with % nix develop --extra-experimental-features 'nix-command flakes' .#ci-crossBuilding using cross compiler: And then running: The base command line used by What is confusing to me is that this happens regardless of whether If you don't want be bothered by auto-settings, we can for now set |
|
It looks like the issue is solely due to it'll fail with Probably some configuration is missing in the invocation of |
|
|
Hmm. I'm not sure if qemu is able to pull the standard runtime and dynamic libraries that match with the "vector ABI" (the calling convention changes somewhat with vector registers). I had to link the test programs with Anyway, the code assumes "vector 1.0" extension, 256-bit VLEN, and the standard "gc" stuff, so hat looks correct. I left the Keccak instruction stuff out. |
|
I adjusted the CI. Most tests pass, but there are some issues in the ACVP tests and the monobuild examples. The ACVP one seems to be an issue in the test script (not expecting parameters in the exec-wrapper), while the monobuild failure seems to be an architecture confusion. I'll take a look. |
951ab19 to
0c4d7e8
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚠️ Performance Alert ⚠️
Possible performance regression was detected for benchmark 'Intel Xeon 3rd gen (c6i)'.
Benchmark result of this commit is worse than the previous benchmark result exceeding threshold 1.03.
| Benchmark suite | Current: 01bab43 | Previous: 5aaca94 | Ratio |
|---|---|---|---|
ML-KEM-768 encaps |
31969 cycles |
30015 cycles |
1.07 |
This comment was automatically generated by workflow using github-action-benchmark.
I was wrong. I tried out what seemed like the straightforward mulcache implementation to me: void mlk_rv64v_poly_mulcache_compute(int16_t x[MLKEM_N / 2],
const int16_t y[MLKEM_N])
{
#include "rv64v_zetas_basemul.inc"
size_t vl = __riscv_vsetvl_e16m1(MLKEM_N) / 2;
size_t i;
for (i = 0; i < MLKEM_N; i += 2 * vl)
{
vint16m1_t b1 =
__riscv_vget_v_i16m1x2_i16m1(__riscv_vlseg2e16_v_i16m1x2(&y[i], vl), 1);
vint16m1_t z = __riscv_vle16_v_i16m1(&roots[i / 2], vl);
vint16m1_t bt = fq_mul_vv(b1, z, vl);
__riscv_vse16_v_i16m1(&x[i / 2], bt, vl);
}
}
static inline void mlk_rv64v_poly_basemul_mont_add_k(int16_t *r,
const int16_t *a,
const int16_t *b,
const int16_t *bc,
unsigned kn)
{
size_t vl = __riscv_vsetvl_e16m1(MLKEM_N) / 2;
size_t i, j;
for (i = 0; i < MLKEM_N; i += 2 * vl)
{
vint32m2_t acc0 = __riscv_vmv_v_x_i32m2(0, vl);
vint32m2_t acc1 = __riscv_vmv_v_x_i32m2(0, vl);
for (j = 0; j < kn; j += MLKEM_N)
{
vint16m1x2_t x = __riscv_vlseg2e16_v_i16m1x2(&a[i + j], vl);
vint16m1x2_t y = __riscv_vlseg2e16_v_i16m1x2(&b[i + j], vl);
vint16m1_t x0 = __riscv_vget_v_i16m1x2_i16m1(x, 0);
vint16m1_t x1 = __riscv_vget_v_i16m1x2_i16m1(x, 1);
vint16m1_t y0 = __riscv_vget_v_i16m1x2_i16m1(y, 0);
vint16m1_t y1 = __riscv_vget_v_i16m1x2_i16m1(y, 1);
vint16m1_t yt = __riscv_vle16_v_i16m1(&bc[(i + j) / 2], vl);
vint32m2_t x0y0 = __riscv_vwmul_vv_i32m2(x0, y0, vl);
vint32m2_t x0y1 = __riscv_vwmul_vv_i32m2(x0, y1, vl);
vint32m2_t x1y0 = __riscv_vwmul_vv_i32m2(x1, y0, vl);
vint32m2_t x1yt = __riscv_vwmul_vv_i32m2(x1, yt, vl);
acc0 = __riscv_vadd_vv_i32m2(acc0, x0y0, vl);
acc0 = __riscv_vadd_vv_i32m2(acc0, x1yt, vl);
acc1 = __riscv_vadd_vv_i32m2(acc1, x0y1, vl);
acc1 = __riscv_vadd_vv_i32m2(acc1, x1y0, vl);
}
__riscv_vsseg2e16_v_i16m1x2(
&r[i],
__riscv_vcreate_v_i16m1x2(fq_redc2(acc0, vl), fq_redc2(acc1, vl)), vl);
}
}But this is notably slower. Interestingly, even ignoring the mulcache computation itself (which is almost twice as slow as @mjosaarinen What can I learn from this? My guess is that |
Signed-off-by: Hanno Becker <[email protected]>
01bab43 to
901105a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Three minor comments/questions, but none of them is blocking.
Happy to go ahead and merge this PR and revisit it in a follow-up if needed.
Thank you @mjosaarinen for contributing and thank you @hanno-becker for all the work you have put into this!
4514c4c to
ecb4cef
Compare
…sics.) Signed-off-by: Markku-Juhani O. Saarinen <[email protected]> Signed-off-by: Matthias J. Kannwischer <[email protected]> Signed-off-by: Hanno Becker <[email protected]>
- Don't skip over 'opt' tests in CI when testing RV64 - Configure qemu to use 256-bit VL. Once we support other VLs, the invocation of qemu needs to be generalized accordingly. Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Matthias J. Kannwischer <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Markku-Juhani O. Saarinen <[email protected]>
This commit makes minor changes to the RV64 backend to make the non-NTT component VLEN-agnostic. This mostly means using __riscv_vsetvl_e16m1 instead of a hardcoded VLEN, plus some added tail handling in rejection sampling. For NTT and invNTT, the implementation is specific to VLEN=256, and fallback to the C implementation for other vector lengths. This should be improved in the future. The compile-time guard for the RV64 backend is changed to checking for RVV support only, but no specific VLEN. Similarly, the default build flags use 128 rather then 256 as the minimum VLEN. The README is adjusted accordingly. We also remove the warning regarding the code being experimental, as it has by now undergone thorough review and test. Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
The backend API requires a bound of 8 * MLKEM_Q on the absolute value of the output coefficients of the NTT. This bound is already satisfied by the current implementation (which is aligned with the AArch64 and x86_64 backends) and hence no final reduction is needed. Signed-off-by: Hanno Becker <[email protected]>
A plain poly add/sub is not part of the backend API because we trust autovectorization to handle it. The corresponding functions mlk_rv64v_poly_add and mlk_rv64v_poly_sub are therefore dead code and removed from the backend. Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
Use high subtraction rather than addition in the Montgomery reduction to avoid the possibility of a carry-in from the low half. cf https://eprint.iacr.org/2018/039. Signed-off-by: Hanno Becker <[email protected]>
Previously, the invNTT would keep the coefficients in unsigned canonical range [0,MLKEM_Q). This commit changes this to a lazy reduction strategy following the AArch64 and x86_64 backends. Bounds are tracked coarsely and reductions introduced where necessary. The reduction is certainly not optimal, but the returns of further improvements are diminishing while complicating review. Bounds assertions (debug only) are introduced to check the bounds at runtime. Signed-off-by: Hanno Becker <[email protected]>
For VLEN >= 512, there are tail iterations in the rejection handling loop where we require less coefficients than fit into a vector, requiring a adjustment of the dynamic VL. The previous code did re-evaluate the dynamic VL in every iteration, which incurred a signifcant runtime cost. This commit instead splits the rejection sampling loop in two nested loops, where the inner loop proceeds for a fixed VL and the outer loop re-evaluates the VL. For VL <= 256, there is only one iteration of the outer loop, rendering it as efficient as the original verison. Signed-off-by: Hanno Becker <[email protected]>
Signed-off-by: Hanno Becker <[email protected]>
This is required by `-Wconversion`, and is indeed worth a comment since a cast from `unsigned` to `int` can in theory overflow. Also, unify unsigned types used in `rej_uniform` to `unsigned`, and add explicit casts where the RVV intrinsics return `unsigned long`. Signed-off-by: Hanno Becker <[email protected]>
ecb4cef to
50d4992
Compare
|
Many thanks @mjosaarinen, this is a great addition. When do you think you'll get to writing a Keccak backend, and adding [inv]NTTs for other VLENs? |
Summary:
rv64v support (risc-v vector extension 1.0, which is available on newer application-class silicon.)
Do you expect this change to impact performance: Yes/No
yes (risc-v only)
If yes, please provide local benchmarking results.
Roughly 2.5x perf on silicon with vector hardware.